NOV 25, 2025

57 MIN

Ep. #2, Data Journalism Unleashed with Simon Willison

GuestsSimon Willison

HostsCL Kao, Dori Wilson

light mode

about the episode

In episode 2 of Data Renegades, CL Kao and Dori Wilson speak with Simon Willison. Together they dive into the origins of Datasette, the evolution of data journalism, and the surprising ways open source tools shape global reporting. Simon also explains how LLM-based agents will redefine data cleaning, enrichment, and analysis. A must-listen for anyone building or scaling data teams.

about the guests

Simon Willison is an engineer, open-source maintainer, and co-creator of the Django web framework. He is the creator of Datasette, an ecosystem of tools designed to make publishing, exploring, and working with structured data accessible to everyone. His work spans data journalism, AI-assisted developer tools, and scalable knowledge-sharing systems used by newsrooms, researchers, and organizations worldwide.

show notes

Simon Willison’s Blog
Datasette (Simon’s project)
Django Web Framework
The Guardian (Data Blog origins)
ProPublica
Baltimore Banner
Washington Post
Bellingcat (open-source investigations)
NICAR Conference (National Institute for Computer-Assisted Reporting)
The Carpentries (software skills for researchers)
SQLite
Heroku
Vercel
Git (version control)
dbt (data transformation)
Claude (Anthropic)
Cursor
Claude Code
Jupyter Notebooks
Mark Steel’s “Mark Steel’s in Town”
Charles Stross – The Laundry Files
Ben Aaronovitch – Rivers of London series
San Francisco Tree List (Open Data)

about the episode

about the guests

show notes

transcript

Dori Wilson: Hi, welcome to another episode of Data Renegades. I am one of your hosts, Dori Wilson, the Head of Data and Growth at Recce.

CL Kao: And I'm CL, CEO of Recce. Today our guest is Simon Willison. I've known Simon since before he co-created Django, the popular Python web framework. We'll dive into Simon's passion for data journalism and the Datasette project along with his pioneer work with LLMs.

He's probably had the blog post that hits the Hacker News front page the most for the past years I've seen, and then all his takes on vibe engineering and how this is now changing the way we work with data.

Hello Simon. Welcome to the podcast.

Simon Willison: Hi. It's really great to be here.

CL: So take us back to the beginning. You started as a web developer and built a very popular framework like Django, but what problem first pulled you into the data space?

Simon: I mean, I've always been really interested in the data space from the point of view of journalism.

There's this whole field of data journalism, which is using data and databases to try and figure out stories about the world. It's effectively data analytics, but applied to the world of news gathering.

And I think it's fascinating. I think it is the single most interesting way to apply this stuff because everything is in scope for a journalist. You're telling stories about the world. So many stories have data components behind them.

And when we got started with Django, that was actually a project that we built at a local newspaper in Lawrence, Kansas. I had a year's paid internship from university where we went to work for this local newspaper in Kansas with this chap Adrian Holovaty. And at the time we thought we were building a content management system.

We were building this piece of software that we called The CMS and the idea was to make it as productive as possible to build out this this tiny little local newspaper. But the local newspaper, it turns out there's a lot of data, right? There's every band in town, every event that's happening, every restaurant, there's all of this stuff about your little town, which if you stick it in a relational database you can do really cool things with.

And so we weren't really thinking in terms of like anything more than what does it take to serve a small town, but even in a small town, having rich like well constructed database tables that cover the array of things going on is super useful.

My favorite example of that is something that Adrian built. We had a database of music venues in town and all of the bands that were playing, and this was in 2003, 2004. But the town had quite good broadband so we could serve up MP3 recordings of the different bands. And Adrian built a feature of the site called the Downloads Page.

And what it did is it said, okay, who are the bands playing at venues this week? And then we'll construct a little radio player of MP3s of music of bands who are playing in Lawrence in this week. And so you can listen to the music that you might hear and it shows you, "Hey, and you can see this band at the Bottleneck on Thursday." That's so cool.

Dori: That is really cool. Yeah.

Simon: Right? That is such a cool thing to build. And this is 1970s relational database technology, right?

Dori: Oh my god.

CL: Wow.

Simon: Yeah.

There's nothing cutting edge about this. But it really showed that if you structure that data, even at the scale of a tiny little college town in Kansas, you can build really cool features on top of it.

So Django, the open source framework actually came out of that newspaper about six months after I left my internship. They got the go ahead to open source it and it's since been used for Instagram and Pinterest and NASA use it and all sorts of places like that. We have no idea, right? We thought we were building a CMS for a small town newspaper.

Dori: Literally from the heartland.

Simon: Absolutely. Right? And a lot of my work since then has been in data journalism. I worked for the Guardian newspaper for a few years doing data-driven reporting projects there.

And I just love that challenge of building tools that journalists can use to investigate stories and then that you can use to help tell those stories. Like if you give your audience a searchable database to back up the story that you're presenting, I just feel that's a great way of building more credibility in the reporting process.

CL: Indeed, I mean, I have my fair share of poking with data journalism, like working with like campaign finance data, legislation data. What's the most like impactful data journalist stories you can think of?

Simon: One of my absolute favorites was something the Washington Post did a few years ago where they managed to use freedom of information requests. They managed to get hold of every opioid prescription that had been made across the US for the entirety of the opioid crisis. And this was like millions and millions of rows of data. I think it was over a hundred gigabytes of data that they collected.

And so they were able to tell stories about, look, here are tiny little towns that they're getting more opioid prescriptions than a town 10 times their size should be getting. But something the Washington Post did that I thought was extremely forward thinking is that they shared that with other newspapers.

They said, "Okay, we're a big national newspaper, but these stories are at a local level. So what can we do so that the local newspaper and different towns can dive into that data for us?" And they put a lot of work into APIs and interfaces for local reporters at local papers to be able to dig into just that section of the data.

I love that kind of stuff. The sort of collaborative data journalism where it takes a lot of work to gather and clean up and make this stuff available. I think there's something very like, it's a very noble thing to then spread them out and make sure that those smaller local publications can work off the same data.

CL: Yeah, it's almost like bringing open source to this journalism, right?

Simon: Absolutely. There's a lot of crossover there. It's really interesting seeing how news organizations, like not many news organizations can afford data and the necessary reporters. You know, the Washington Post and the New York Times, the San Francisco Chronicle actually have a very good, or albeit small team for this, but there's a lot of recognition in the industry that we need these capabilities to go beyond just these large publications.

And it's a very generous sector. There's an annual conference I go to called NICAR, The National Institute of Computer Assisted Reporting, which is what they've been calling data journalism since I think the seventies when they were doing it with punch cards and mainframes. And this conference pulls a thousand people a year. And those thousand people are a huge mix of practitioners and journalists from big and small publications. And they just share tips on everything, right?

There's no competition in this space. It's all about trying to figure out what is the most value we can get out of this technology as an industry as a whole.

CL: I love that where the data representing the facts is now center stage of what we're doing.

Dori: Yeah, my first job and how I got to San Francisco was working as an economics research associate at the SF Fed. So very much love, love using numbers and analytics to help make sense of the world.

Simon: I mean, that's effectively data journalism. That's exactly the same activities as data journalism, just in a slightly different shape.

I build open source software for data journalism and my cutting plan here is journalists have no money, I'll give it to them for free. Everyone else in the world with money needs the same tools. There's nothing I can build a journalist to help them find stories and data that isn't economically valuable to everyone else.

Dori: I agree that was one of the things when I switched into tech that I realized about econometrics, where it's like, oh, it's still stats, it's still probabilistic, it's still inference, and it's just we're using it for a different purpose at a different scale. A little less noble when you switch into the private sector than the public sector.

Is there anyone that you think is doing a really good job? You've mentioned the Chronicle, you've mentioned WaPo. Is there any, especially independent journalists or other local journalists you think that are really good at this?

Simon: Yes, the really interesting trend in the United States in the past sort of 15 to 20 years has been the rise of the nonprofit newsrooms. There are hope, 'cause the business model for journalism is in a pretty bad shape at the moment, but there's a whole new set of nonprofit newsrooms who are doing fantastic work.

ProPublica I think are one of the best examples of this. They're less than 20 years old. They have an amazing data team and they break data-driven stories all the time. They're absolutely worth paying attention to.

The other one, I love promoting this one just because they're such a breath of fresh air and good news in this industry full of bad news, there's an organization called the Baltimore Banner who only started I think three or four years ago. They were sort of splinter group from the Baltimore Sun and the Baltimore Banner are a nonprofit newsroom.

They have a hundred employees now for the city of Baltimore. This is an enormously, it's a very healthy newsroom. They do amazing data reporting, they're doing fantastic social media stuff. They've got this incredible focus on Baltimore and Maryland and that's sort of scope of what they're doing. And I believe they're almost breaking even on subscription revenue, which is astonishing.

Like they have some big philanthropy donors and things, but a new newsroom with a hundred people, I believe they just won their first major like reporting award recently as well. They've got, again, a small data team that punches way above their weight. I'm so thrilled by that. That's exactly the kind of publication that we need to see more of.

Dori: And then beholden to the subscribers, their audience and their community. That's really cool.

CL: Right you mentioned about your plan is like giving this tool away so that the data journalists with no budget can use that. Is that what kind of led you to create a Datasette?

Simon: So the original spark was back in I think 2017, I was thinking a lot about serverless hosting providers. You know, the things like the Vercel and Cloud Run hadn't even come out yet, but they were, and Heroku had been around for ages and they were getting really inexpensive, right? They all had free tiers, where you could spin up a web application. But the one thing you weren't allowed to do is run a database on it.

Because the whole thing with these serverless hosting providers is that they're stateless, right? Like Heroku has always been stateless. You don't get a file system that you can write to, but anything that you can do without writing back to the files is useful.

And a few years before, back until the 2009, 2010, I was working at The Guardian and one of the initiatives I was involved in that was called the Guardian Data Blog, where we realized that we were collecting lots and lots of data for the infographics in the newspaper, right? Anytime a newspaper has like a pretty map, somebody had to go out there with Excel and collect the numbers.

Wouldn't it be great if we publish those numbers as well? And so at The Guardian, we were doing that with Google, we started doing it with Google Sheets. We're like, okay, the numbers behind the graphics are available in the Google Sheet, and we'll stick them up on our data blog. And it was really successful. Like we had a whole community of people on Flickr building extra visualizations against Guardian data and stuff.

But I was always frustrated that Google Sheets was the state of the art for sharing data on the internet that didn't feel good enough you know? And I had this, it was literally a shower revelation. I was in the shower thinking about serverless and I thought, "hang on a second. So you can't use Postgres on serverless hosting, but if it's a read-only database, could you use SQLite? Could you just take that data, bake it into a blob of a SQLite file, ship that as part of the application just as another asset, and then serve things on top of that?"

Turns out that works incredibly well. So the first idea for Datasette was let's build something where you can take a SQLite database full of interesting information about the world, ship it to a hosting provider with a simple UI on the top that lets people filter through the tables and run a few searches and run SQL queries and bundle the whole thing and make it as easy to deploy as possible. And that works. And that's kind of cool.

In the past I've thought about it like how Pinterest solved scrapbooking and WordPress solved blogging, who's going to solve data like publishing tables full of data on the internet? So that was my original goal.

And then Datasette's written in Python, it's very easy to add plugins into Python programs. So I added this plugin system to dataset and that unlocked everything 'cause now I'm thinking, okay, well maybe I could write to the database if I have a plugin that can help set up full tech search or you upload a CSV file to it, I can have plugins for visualizations on maps or plugins for bar charts and all of that kind of stuff.

So Datasette today has well over 150 plugins. Most of them were written by me, but not all of them, which is important. And I like this idea of sort of building a multi-tool. My ultimate goal is any problem you have with data, be it how to visualize it, how to explore it, how to clean it up, there should be a Datasette plugin that lets you do that. And it's built on top of SQLite, which is an incredibly robust and flexible data platform to build almost anything on top of.

CL: Right, so it started like solving that data distribution problem and then allowing kind of a casual use like filtering through the data, but then the plugin ecosystem kind of enables all other things. What are the kind of unintended use cases you've seen for Datasette?

Simon: One, I think it might have been Copenhagen. There was a city in Europe that used Datasette for their electricity grid information, which was very exciting. Like, 'cause I mean, who would've thought, right? So that was a fun one.

There's also, somebody was doing research on the Brooklyn Cemetery and they got hold of the original paper files of who was buried in the Brooklyn Cemetery. They digitized those, loaded the results into Datasette and now it tells the story of immigration to New York. Because you can say, oh, how many people who were born in Poland are now buried in the Brooklyn Cemetery in what years and so forth. That was a really exciting one.

Then, possibly the most impactful one, there's an organization called Bellingcat who do reporting out of Eastern Europe, and especially on what's happening in Russia. Interestingly, Bellingcat were an open source reporting project, which is the other definition of open source. It means openly available sources of information. And it turns out the intelligence community use that term long before the software engineers started using the term open source.

But Bellingcat, they often get leaked data and somebody hacked into the Russian equivalent of DoorDash, the food delivery service, and leaked the entire database to Bellingcat. And it turns out the Russian FSB, their secret police, have an office that's not near any restaurants and they order food all the time. And so this database could tell you what nights were the FSB working late and what were the names and phone numbers of the FSB agents who ordered food.

CL: Oh gosh.

Simon: And so they were using my software to let all of their reporters search that data and correlate it against other leaked databases. And I'm like, "Wow, that's going to get me thrown out of a window, that one." That's a very exciting application of the software.

Dori: Oh my gosh. But no, it's cool 'cause like it's the Pentagon story of people when you see them ordering a bunch of pizza, like there's a whole kind of index around that of like, oh man, we're about to do something sketchy

Simon: Exactly. Yeah, so stuff like that, that's the best and the worst thing about open source, the best thing is that anyone in the world can use it. The worst thing is they never tell you, like I heard about the Bellingcat thing because somebody tipped me off that was on their podcast that they mentioned it.

CL: Wow.

Simon: And that's a frustration I've had. Like you can release all of the software and you'd get silence and then you find out later that people have been using it all over the place. I go to this journalism conference mainly for the corridor track where people go, "Oh hey, we're using Datasette at the Wall Street Journal to track CEO compensation," which they are. That's amazing. I had no idea.

Due to this thing on Fridays, I have an open office hours Calendly, where the invitation is, if you use my software or want to use my software, grab 25 minutes to talk to me about it. And that's been a revelation. I've had hundreds of conversations in the past few years with people. That's how I learned about the Brooklyn Cemetery Project is somebody signed up for office hours and showed me what they were doing. But yeah.

An endless frustration in open source is that you really don't get the feedback on what people are actually doing with it.

Dori: Yeah.

CL: Definitely. And then, yeah, I'm trying to think. Datasette kind of solved that presentation and then kind of publication and then, but people are able to collaborate on that now, right? For the curation and then data cleaning or all the kind of data engineering what needs to happen before that is made available.

So what do you think is the hardest part for data engineering today, given we have all those tools now and then nobody's talking about the hard part?

Simon: So I feel like some of it is around sort of best practices and habits. Like I used to work for a large company that had a whole separate data division and I learned at one point that they weren't using Git for their scripts. They had Python scripts, littering laptops left, right and center and lots of notebooks and very little version control, which upset me greatly.

But it is important, right? Having really good habits about keeping that pipeline documented and clean and repeatable, I still think is a problem where that there are tools out there but they're not necessarily widely spread.

So there's a lot of sort of basic fundamentals where that thing where there's an organization called The Carpentries. Basically they teach scientists to use Git. Their entire thing is scientists are all writing code these days. Nobody ever sat them down and showed them how to use the UNIX terminal or Git or version control or write tests. We should do that.

They do that and I love that. I love that there's this organization out there that's trying to sort of get these software engineering fundamentals that you won't necessarily pick up if you don't go through a sort of traditional software engineering like career progression. So that I feel is really important.

Dori: Yeah. I didn't use Git until I got my first job in tech. Even though in Econ, you know, we were doing a lot of coding. We were deep in R, did a lot of R scripts in our own packages at the Fed to use internally. Didn't have it, definitely bid us a couple times if something was stored locally and then no one pushed it to a shared drive and then they left and they wiped the computer, and now you have to recreate all of their functions just from a name.

Simon: Right. And to be fair, Git is notoriously difficult to use, right? It's a fabulously obtuse piece of software, but it does unlock so much if you can build those sort of organizational habits around it. So there's that side of things. I know every single person I talk to in data complains about the cleaning that everyone says, "I spend 95% of my time cleaning the data and I hate it."

That's one of the things I'm really interested in exploring, sort of coding like agentic coding stuff for addressing. Because I feel like that data cleanup step, again, if you can make sure it's well documented and explained and you can undo things if they go wrong and so forth, there's so much value that we can get out of that.

CL: I agree and then I feel like every paradigm shift starts with like, you're able to do something that is now 10x, and then that redefined something that used to encompass that. So I think we touched on that the other day. It was like data cleaning is actually making a lot of attempts to transform this data and verifying that, right? And then it is actually making changes and verifying that and then we can now make changes 10 times faster with the help of 'em.

So the question and the bottleneck is really how we verify that result. So I'm thinking like there should be a better way to do data cleaning these days and then make the 95% of people happier. What do you think?

Simon: I'm very confident, I'm trying to build that at the moment. For journalists as well, like journalists, do you really want to tell journalists that they have to spend three weeks cleaning up all of their data before they can start reporting on it? That's not reasonable.

You know, one of the fun things about working in journalism is it's very deadline driven. If it takes you six months to do an analysis, the news cycle has moved on. You need to be able to turn things around. In fact, Django's original tagline was web development for perfectionists with deadlines, which I loved. That was such a great way of capturing the spirit of the project.

CL: Okay so we talked about like kind of the tooling and then kind of getting the data ready and then publishing that. But if you had to kind of review one piece of the data stack from scratch, what would you pick to kind of do differently?

Simon: One of the problems that I've had seen the most pain from is actually around data documentation. At a company I worked for, we had one product launch which had to be canceled at last minute 'cause it turns out that the data reporting that we'd done to support that project where we predicted how much money it would make had missed a fundamental part of the data model because the data analyst who did it hadn't understood a hundred percent of the data model they were working with.

So they missed out on something which like caused a 10x difference in the estimated cost of the project. That kind of thing's terrifying. And I realized at the time that we've got a software engineering team who are mucking around, changing database scheme rule of the time. We've got a data reporting team who work off of the data warehouse copy of that data. Changes that the engineers make aren't exactly in a change log. You know, there was very poor practice for helping document those changes.

A coworker of mine said, you do realize that this should be a documented API interface, right? Your data warehouse view of your project is something that you should be responsible for communicating to the rest of the organization and we weren't doing it. And that feels to me super important. Like every production database is changing, the schema changes all the time.

That idea that the fact that there's all of these downstream customers that you might not know about is a sort of classic like API design problem, but I don't think most data teams think in terms of like API contracts that you are making to your users. So I think there's a lot that can be thought about a lot more deeply in that that part of the world.

CL: Yeah I feel that industry has several iteration on that problem like you, but this a general data quality problem or people say is like data contract problem or data culture problem. But it's essentially the context for the data wasn't really aligned across the organization. Do you feel there's more a tooling issue or a cultural issue?

Simon: I'm going to say both. I feel like it's absolutely a culture issue, but the good tools would use friction in a way that encourages the culture in a certain direction. So I feel like there's a lot of scope for tooling to help sort of smooth that. It's that cow path thing. If you smooth the right cow paths, people will naturally head in those directions.

That's actually, I mean, a very simple version of this and something I used to bang the drum about all the time. If you show somebody a report, you need to have view source on those reports. Like at my previous employer, somebody would say 25% of our users did this thing. And I'm thinking I need to see the query because I knew where all of the skeletons were buried and often that 25% was actually a 50% because maybe they're double counting certain users, or maybe they're ignoring users with the "is deleted" flag or users that haven't been active. There's so much like that.

So that I feel is a cultural thing and a tooling thing. I never want to see a business report that doesn't have a way of me guessing to the queries and actually auditing and seeing how that report was produced.

Dori: Do you think that's because though you're a more technical stakeholder who is more data literate?

Simon: Yes, hundred percent. But at the same time I feel like you've got to give people a fighting chance. Like this is something where the very best newspapers, like I've talked to the data reporting team at Reuters and they treat data exactly the same way as they treat a story. Their stories are fact checked, no story goes out the door without someone else fact checking it and without an editor approving it.

And it's the same for data. If they do a piece of data reporting, a separate data reporter has to audit those numbers and maybe even produce those numbers themselves in a separate way before they're confident enough to publish them. And that's, I mean this is like a world class news organization that is doing this.

But I feel like, you know, in a corporate setting, sure there might not be that many data literate people but there's more than one. So if you make sure that there's anyone who's data literate has the chance to at least go, "Hang on a second, is that right?" And look into it, that feels healthy to me.

Dori: Well and definitely for increasing trust, like you were saying about like this number, I don't trust it, but if you can go and look at the query or the source table and be like, "Oh actually this does look correct, I can verify."

Simon: There's also a lot to be said for sign-offs. I'd love to see a report where it says "Oh and three other data reporters have approved this message." You know, that stuff like that would help me a lot personally.

Dori: Yeah, is there something you think talking about, you know, we're not yet going to get to the API documentation for most teams 'cause this is, at heart, a people problem I think a lot of the time. What is something that some practitioners could do today, you know, to start moving their team in that direction?

Simon: I mean the big one is having good data dictionaries. You know, having like a table with mysterious columns with weird acronyms that isn't explained somewhere is a huge problem. The challenge with documentation is always that maybe you've got a Wiki. And the problem with the Wiki is that I go to the Wiki and it says last updated six months ago. And I think, well it's clearly that's, it's a waste of my time to read this documentation here. I've solved that for code.

The way you solve that for source code is you keep the documentation in the same repository as the code. And in your code review process, you have a hard block on, you cannot ship a change that makes the documentation out of date. Just a comment saying make sure you update this documentation. And I found that if you do that within a few months everyone learns that the documentation is trustworthy, because they've all had their changes blocked. So they've been, "okay, now we know that the documentation is in good shape, I can start trusting it."

I just don't know what that looks like for data. Because data reporting doesn't usually doesn't quite fit into that same maybe, I mean if all of your SQL is in a GitHub repository with your markdown documentation, I think that's ideal. But I think that's probably not how most teams are operating.

CL: Yeah I think that the software modularity helps, like putting the code and document together, but for data like, I mean DBT is a great star as in like metadata and transformation logic is in the same place. But usually you have like more than that, right?

You have all your weird ingestion pipeline somewhere and then kind of maybe another governance tool that needs to sync all this document. But which one is the real one then?

Simon: I think one thing, and this is a bet that I'm taking with features I'm building the Datasette. I think the queries themselves need to be first class citizens where like I want to see a library of queries that my team are using and each one I want to know who built it and when it was built.

And I want to see how that's changed over time and be able to post comments on it and see something that says, "Oh, John says, hey, watch out for this query that column's changed its meaning" or whatever. So I think there is, and again this is like software that drives culture.

Something about software that helps with that sort of collaborating and revealing what those queries are and making sure that people can see that it's a living, breathing system that feels to me like it should help.

Dori: Yeah. Some of the things that I've had with my teams is, you know, like CL mentioned and you've mentioned having the markdown file and like the DBT repo, it's so easy to do. We had blocks at one of my companies in order to build our documentation of like, if you're touching something, there's not a document you have to do it no matter how big the PR was. And that was one way to do it.

But then we'll have a lot of my downstream stakeholders, which are very data dependent, some more or less data literate, aren't code literate. So they're taking the data that you have from a version control environment, even if it's a BI tool and they're putting it into a sheet to manipulate it to then put into a board deck, right? Or to present about the function of their product.

So I think that's one of the things that I have seen. I'm curious if you have thoughts there about, you know, once we have a code, that's one thing and teams still struggle with that, but then it's, that's not always where it's being used. Sometimes just where it's being structured.

Simon: I feel like the moment it gets to Excel you lose all control. Like it's in the new world, right? And I find it fascinating, like Excel, on the one hand, Excel is terrible for version control and visibility and all of that kind of stuff. On the other hand, the entire world economy runs on Excel spreadsheets and it kind of just about managers to leak on long and work.

When people complain about vibe coding, I like to point out that we've been running everything off of un-version controlled random spreadsheets and we somehow managed to survive all of the crises that come out of that.

It's tricky. And that's where it becomes a cultural thing. You want to make sure that people have really good habits, that they're sharing those Excel spreadsheets somewhere where other people can dig in and see how they work. I feel like that's the hardest part of this stuff. That end of things is where it gets most complicated.

CL: Right. We talked a bunch about all the cultural things for like how the data thing work. What's the best advice you'd give to someone just building out their first data team?

Simon: So I'm really keen on a culture of documentation and one of the things I figured out about documentation is there are various different types and there's the type of documentation, which is the official docs. And those must be kept wholly and they must be good and they must be up to date because if people lose trust in the official documentation, it takes years to earn that trust back again.

But there's another type of documentation which I call temporal documentation where effectively it's stuff where you say, "Okay, it's Friday, the 31st of October and this worked." But the timestamp is very prominent and if somebody looks that in six months time, there's no promise that it's still going to be valid to them.

And the way this works, when I've worked with companies, I've basically had an internal blog, right? You have a little internal blog where you say, here's the SQL report that I just ran to try and figure out this thing and this is what it gave me. And so I've done this, there are so many ways you can do this. I've done it as a Slack channel that's just a Slack channel, that's my blog.

Confluence have a blogging platform. So you can run a blog on Confluence if you like. The key thing is you need to start one of these without having to ask permission first. You just one day start, you can do it in a Google Doc, right? You can have a Google Doc called Simon's Blog, and add a new page to it every time you post a new entry, this works so well because it gives you a place where you can share knowledge, where you are not making any promises at all to keep it up to date in the future.

So it's very low like that there's low commitment on your behalf. It gives you so much credibility really quickly because nobody else is doing it. So if you're the only person in the entire organization that's actually publishing useful notes saying, "Hey, so I dug into the orders table and I figured out what these three things mean," that all becomes valuable.

And then over time if you could encourage other people to do that, that's fantastic. Something that I've done at one employer was build a search engine across the documentation, which it turns out existed in seven different systems, right? There were mailing list archives and Wikis and GitHub repositories and support bases and so forth. And everyone at the company felt the documentation was terrible.

It turns out, once you get a search engine over the top, it's good documentation. You just have to know where to look for it. And if you are the person who builds the search engine, you secretly control the company.

CL: Exactly.

Simon: So there are lots of tricks like that. I think my tip would be be a little bit, I mean what's the name of this podcast? Data Renegades. Be a little bit renegade, right?

Dori: Yeah, yeah.

Simon: Be a bit renegade. Start writing documentation. Don't promise that it's going to stay up to date in the future. Like give it timestamps. But I think you can have a really big impact by doing that.

Dori: Well and something that with the blog too that you kind of mentioned is it's a lot of visibility and a lot of data teams struggle with visibility and proving value if they're non-revenue generating. And that can be a really good way to put your name out there, get it into the organization in a way that's very visible.

Simon: Absolutely.

Dori: And not just like when something breaks.

CL: Yeah. You touched on two types of documentation. One is like you kept the truth around the data that the data org manages. And the other is really the living almost like materialized tacit knowledge where like I'm a big fan of your TIL blog. You just blog about like all the things you play with, right? And then something might come out of that.

Simon: And that works internally as well. So my TIL blog, I've got a few different blogs I'm writing, but there's one called TIL which sounds for Today I Learned and the barrier for writing something on there is I just figured something out. That's it. So I've done TILs about "for loops" in Bash, right? Because okay, everyone else knows how to do that. I didn't.

And so I get to knock up a quick things thing saying here I figured out how to do this thing. And partly I do it as partly as it's my public notes, right? It's a useful little, I'm taking notes anyway, why not stick them online? There's a part of it that, it's a sort of statement that I make. It's a value statement where I'm saying that if you've been a professional software engineer for 25 years, you still don't know everything. You should still celebrate figuring out how to learn "for loops" in Bash. That's like, there's no shortage of new things to learn across the entire space.

But yeah, I've, but my internal blog has sometimes at companies it's been full of those little TILs like today I learned what the order service does or today I learned why our internal ORM has this weird name. Stuff like that. People love that right? And it's very inexpensive content because with the TIL you're not trying to present new like radical information to the world. You are literally just, you figure something out, you write it up, there's a couple of paragraphs and you stick it out into the world.

CL: Yeah. This is amazing. So I want to take us like forward a little bit. And I know you've been playing with LLMs intensively since it's first publicized. B ut fast forward five years, what's going to feel like laughable, outdated about how we handle data today?

Simon: So that in question is impossible because the rate at which the AI space is working, I can't predict what's going to happen in maybe two months I could do but longer than that I've got no idea.

I haven't thought about this quite as much in the data space. I've been focusing so much on the AI assisted programming side of things, which has got super interesting this year. Like last year, LLMs could write you a Python function if you ask them for a Python function. The big change this year is back in February, Claude Code came out and since then it's been joined by all of these coding agent tools and all those are is the realization if you get a good LLM and you stick it in a loop with a terminal where it can run commands and run code, it can solve problems for you 'cause it can use trial and error, it can experiment, it can read error messages and fix them and so forth.

And that stuff has got so good in the past six months. And what's interesting about those is, and I'll talk about Claude Code 'cause it's the one I've used the most. They pretend to be programming tools but actually they're basically a sort of form of general agent because they can do anything that you can do by typing commands into a Unix shell, which is everything, right?

There's so many things that you can do with that. I have them running FFmpeg for me and they can interact with APIs and do all of this sort of stuff. And the obvious application there is data, right? Data cleaning is basically a case of, "curl this thing, download it here, load it up into Python Pandas, load it into this one." They're so good at that stuff right now.

I just pitched talks to the data journalism conference next year that's basically coding agents for data. Like what can we do with these new things? A couple of weeks ago, Anthropic released this new thing called Skills for Claude. And all skills are is a markdown file that says here's how to x that the agent will read when it needs to.

But I'm thinking, okay, imagine a markdown file for census data. Here's where to get census data from. Here's what all of the columns mean. Here's how to derive useful things from that. And then you have another skill for here's how to visualize things on a map using D3 or whatever JavaScript library and have another skill for.

At the Washington Post, our data standards are this and this and this. And we always double check these things, feed those into just Claude Code out the box. And it can probably do a decent job of being that assistant to the data reporter that's going through a bunch of that the sort of data cleaning and the work that takes the time and doesn't give you the value. That's exciting.

I think that's available right now today. And we've been so distracted by the coding side of it, we haven't really started digging into that. So in five years' time, I mean heck in a year's time, I expect most people who work with data professionally will be using a tool along those lines. Something which can do so much of that fiddly loading things into a Panda's data frame and DGPing these columns and all of that kind of stuff.

I feel like that's going to be a solved problem to the point that the level that we operate at as data analyst just goes up a level, you know, so we are not doing the manual like writing the cleaning script ourselves, but we are liberated and I find this as a programmer already, I'm liberated to think at a high level about the code that I'm writing 'cause I don't have to type every line of Python myself.

CL: Right. This is super exciting. And then I think there are two important parts of that. One is the generic agent loop and Claude Code with Skills you can do a lot of things, but there's other attempts like cursor for data. There are a couple of startup trying to do like maybe a data-specific IDE.

D o you think we still need a data-specific IDE or just like generic agent loop that data skills would solve that?

Simon: It's an interesting question is it because the thing about Claude Code is it's a terminal app. It's like Vim, right? And I remember when Claude Code came out in February, I was like, well this is a cool hack that nerds will understand and nobody else in the world will get. This will have a tiny audience. I was so wrong. Right?

The terminal is now accessible to people who never learned the terminal before 'cause you don't have to remember all the commands because the LM knows the commands for you. But isn't that fascinating that the cutting edge software right now is it's like 1980s style-

CL: Back to the terminal.

Simon: I love that. It's not going to last. Right? That is an absurd, that's a current absurdity for 2025. B ut there's so much you can do with richer interfaces on top of this where--

Like LLMs are great at spinning up little custom UIs when you need them to. Like the better, if you're a data reporter, a sort more of a notebook interface makes a lot more sense than than a Claude Code style terminal 'cause a Jupyter Notebook is effectively a terminal, it's just in your browser and it can show you charts.

So yeah, I expect that we'll see the Claude Code agent-in-a-loop thing with a much richer environment. Especially for visualizer, for data work where visualization is such a key part of what you're doing, and I'm aiming to build aspects of that in my own Datasette project now that I'm starting to build plugins that bring in the AI side of things now.

CL: Wow. There are too many things to unpack.

Dori: This is just kind of riffing off the data visualization part there. I mean, we'll spin it up. What do you think about the future then of kind of dashboards and BI tools, where their sole purpose is doing these data visualizations?

Simon: One of my favorite things to do with Claude Code or even just ChatGPT and Claude the web can do this. You copy and paste a big chunk of JSON data from somewhere into them and you say build me a dashboard. And they do such a good job. Like they will just decide, oh this should have, like this is a time element so we'll do a bar chart over time and these numbers feel big so we'll put those in a big green box and we'll calculate a percentage.

I do this on almost a daily basis. Like I've got, just the other day I built a little script and all it did was it spanned through 200 different projects and ran their unit tests and out to pass and fail. And then I got Claude to knock me up a little artifact that I pasted that into and now I've got a big green 37 out to 520 on screen right in front of me and it cost me nothing to do. And it was delightful.

So I feel like the future of those BI dashboard tools is they have to be prompt-driven. They just have to be, it makes so much sense that we can spin these things up in such a richer way than having people click through and configure hundreds of panels themselves.

Dori: Yeah. Well and "delightful." I don't think that's a word I've ever heard anyone use for building all dashboard before, but what a bar to hit.

CL: Wow, "delightful"dashboard.

Dori: I've made so many, you know, knock on myself and knock on dashboards. I don't know which one, maybe both.

Simon: If you do it in Claude, you can then say it's good but do it Barbie themed and it will, and now you've got your Barbie themed dashboard, and that's just, I use Claude Artifacts a lot, right? Claude Artifacts, that's the feature of Claude where it builds you a little one-off web app and they're quite restricted in what they can do.

Like they can't make API calls to other websites. They can load a specific set of JavaScript libraries and things. The trick with those is copy and paste. All of the tools that I built. All of the interesting tools are, "build me a tool where I can copy and paste a hundred megabytes of CSB into it and it will do X."

And so I've got this collection of those and just the other day I built one which was a copy and paste from your terminal 'cause I use Claude Code, and it spits out all of this pretty like syntax highlighted terminal output. And I wanted to share that with other people. And it turns out, at least in Firefox, if you copy and paste in the terminal, the paste data has weird RTF like magic numbers and things in it.

And so I got Claude to build me a JavaScript thing that takes that RTF and converts it into HTML, and then I got it to, I've figured out how to get it to write to GitHub gist. So I got it to add a little button. It's saved to GitHub gist, and then there's a website you can use that renders the GitHub gist.

So now I've got a paste click, now I've got a shareable HTML version of my terminal. And at the last moment I was like, "oh and make it green on black, give it terminal vibe."

So now I've got this beautiful like black screen with green writing that does all of this work because it's a terminal tool. It's just wildly entertaining.

CL: Yeah, wow I'm so glad this is fun. And then yeah, before we enter our lightly run, what's something that you think we should ask you but didn't?

Simon: We could talk about the most exciting applications of LLMs to data analysis, 'cause there's three.

CL: Okay.

Dori: Tell us.

Simon: The first one is a very obvious one. Writing your SQL for you like LLMs are stunningly good at out outputting SQL queries. Especially if you give them extra metadata about the columns. Maybe a couple of example queries and stuff. So that's great and lots and lots of people are building that, it's almost like sort of table stakes for building any kind of like data reporting tool now is that that text to SQL loop.

A more interesting one I think and one of the most economically valuable applications of this tech outside of writing code for you is data extraction. So you've got a pile of 20,000 PDFs. This happens in data journalism all the time. You file a freedom of information request and you get back horrifying scanned PDFs with slightly wonky angles and you have to get the data out of those.

LLMs for a couple of years now have been so good at, " here's a page of a police report, give me back JSON with the name of the arresting officer and the date of the incident and the description," and they just do it. The ones that run on my laptop can do this trick really well. That's extraordinary. Like data entry is now something which one of the most expensive parts of working with sort of especially unstructured or messy data. They do that so well.

In my experience, they get it 95 to 98% right, which is an interesting challenge, because what do you do about the remaining 2-5%? This has been a problem with human data entry as well. Like with human data entry, often you'll do spot checks and you'll have multiple people enter the same documents. So there are patterns for this but it is an enormously valuable application.

And then the third one is what I'm calling data enrichment. And this is a feature I've been building into my Datasette software where you've got a table of like 10,000 school reports or something, and there's some additional operation you want to do on all 10,000 of those at once. It might be data extraction, you might have a report that you want to pull out the key details and stick those in another database column.

Maybe you want to do an additional search to figure out, okay, what county was that school in? There's something really exciting about, especially with the cheaper models, the Gemini Flash 2.5 Lite, things like that. Being able to run those in a loop against thousands of records, that feels very valuable to me as well.

CL: Right so text to SQL is kind of solve, I think like the Spider benchmark got to like 75, 80 or something. And then you talk about the extraction and then also augmentation, right? Enrichment as in like given what we have in the database, go for each role, do something, augment that, put more columns to it. And this can be like all now a lot easier.

Simon: Especially against images. The multimodal LLMs that can read images, audio and video. Those things are getting so good. The Google Gemini models in particular, flawless audio transcripts. Interestingly I think Gemini 2.0 could transcribe audio up to about 10 minutes and then it would start getting it wrong. And then in 2.5 that's not a problem. And I only know that because I talked to a Google researcher who was like, "yeah, did you spot that thing where after 10 minutes, it gets a bit wrong?"

So all of these problems, right? So many challenges keeping up with what these things can do. But point being we can now take audio and video and images and they are incredibly cheap to process as well. At one point I calculated that using Google's least expensive model, if I wanted to generate captions for like 70,000 photographs in my personal photo library, it would cost me like $13 or something. Wildly inexpensive.

CL: Wow, yeah I think we can't imagine this like just two years ago.

Simon: Right. Absolutely.

CL: Simon, this has been a great conversation and then before we wrap, we'll like to put you into our data debug run.

Simon: Okay.

CL: Quick fire questions, short answers. Are you ready?

Simon: I think so.

CL: All right. First programming language you love or hated?

Simon: I hated C++ 'cause I got my parents to buy me a book on it when I was like 15 and I did not make any progress with Borland C++ compiler. First one I really loved was PHP 'cause I could build software in PHP and show it to my friends and they could use it and that was amazing.

Actually, my first program language was Commodore 64 BASIC. And I did love that. Like I tried to build a database in Commodore 64 BASIC back when I was like six years old or something.

And Datasette is named after the Commodore 64 cassette player, which is the Datasette.

CL: Right. I see, I see. Now this is all like full circle, right? Okay. Tabs or spaces.

Simon: I'm a Python programmer. Its spaces. Python has settled on spaces.

CL: Great and then biggest bug you've ever shipped into production that you can talk about?

Simon: When I was at The Guardian, we did a MPs expenses crowdsourcing project. So the idea was to, all of the MPs expenses reports had come out and we wanted to have our audience dig through 17,000 pages and figure out which MPs had been spending their expenses on dodgy things and so forth.

And I built it with a, there was a progress bar and a button that you'd click a button and it would give you a random page of expenses. And there was a progress bar on the homepage showing how far we got through. And I tweeted a screenshot of that progress bar and said, "Hey, look, we have a progress bar." And 30 seconds later the site crashed because I was using SQL queries to count all 17,000 documents just for this one progress bar.

So I had to rewrite the whole thing in Redis in like the next hour and a half, because turns out the signature feature of the website was the highest traffic site. And getting everyone click at once was a massive mistake.

Dori: All right. Next question. What's your go-to data set for testing?

Simon: I love the San Francisco Tree List. City of San Francisco have an open data portal, and one of the files on that portal is called Tree List. And it's every tree that's managed by the city. So it's not the trees in Golden Gate Park, but it's all of the other trees.

There's 195,000 trees in this CSV file and it's got latitude and longitude and species and age when it was planted. And who looks after it all of this stuff and get this, it's updated several times a week. And I know this because one day I went to look at my favorite data set of trees and I saw the, it said last updated today. And I'm like, huh, that's interesting.

So I set up a GitHub action scraper to grab a copy of it, like once a day to see if it's changed. And I've been tracking it for like four years now, and most working days, somebody at San Francisco City Hall updates their database of trees, and I can't figure out who, I've been trying to figure out which department is responsible for this.

But isn't that wonderful 'cause that commit history, right? There's so much in there just knowing which neighborhood lost the most trees, which tree species get removed. Oh, that's awesome. Right?

Dori: I'm going to look that up. I live in the mission and there are multiple trees I want to check out. This is fascinating.

Simon: It's great. Yeah.

Dori: Okay. What's one lesson from outside of tech, sports, art, et cetera, that influences how you build?

Simon: I'm obsessed with show running TV shows. Like I'm absolutely fascinated by the art of running a TV show because, and this came from when I was an engineering manager with a team and I was always looking for unconventional sources of advice on management.

And it turns out showrunners are writers, right? You are a screenwriter, and you pitch a show, it gets picked up, suddenly you are in charge of like this overnight startup with a $10 million budget and a hundred employees, and they're all artist creative types, all running around in different directions. You've got a very hard deadline of when the show has to ship.

You've got the network breathing down your neck. So you've got activist investors, like it's astonishingly challenging and you're still expected to write the first and last episode of the series as well. So you think running a tech startup is hard? Go and talk to a showrunner about what they're doing.

And there's this wonderful document called the, this document called "The 11 Laws of Show Running" by a very sort of experienced showrunner with tips. And the best tip he has in there is that if you're a showrunner, there is so much that you have to do. You have to build this world, there are sets that have to be designed and costumes and characterizations and so forth.

It is impossible for you to micromanage all of it. That's like the worst mistake a showrunner can make is trying to micromanage. So what you do instead is you have lieutenants, your writing staff or your senior writers. Your job is to transfer your vision into their heads so they can go and have the meetings with the props department and the set design and all of those kinds of things.

And they can make sure that that vision spreads. And I love that as a model for how to manage a large team that I used to sniff at the idea of a vision when I was young and stupid. And now I'm like, no, the vision really is everything because if everyone understands the vision, they can make decisions you can delegate to them and they can go and make good decisions on your behalf.

Dori: That is really interesting.

CL: Wow. We'll link that to the show notes. T his is like massive scale cat herding right?

Simon: Exactly, what a wonderfully complex thing to try and pull off. And the great thing about showrunners is they're really good writers, so anything they write about showrunning is going to be worth reading.

Dori: Okay, what is one hot take about data that you're willing to defend on a podcast?

Simon: I'm going to say your stuff should be Git, I think it's inexcusable to have executable code that has business value that is not in version control somewhere. That's it. And it's difficult. Somebody needs to teach you how to do this kind of stuff and you need to have that culture. But I think that's the hot take that I will passionately defend.

CL: What's the nerdiest thing you've automated in your own life?

Simon: Okay. I've got a confession. My blog does very well on Hacker News and I have quite a little bit of automation related to that, which helps. I don't have bot armies voting up my post or anything, but I do have quite a complicated GitHub actions scraping set up to alert me when my stuff shows up on Hacker News 'cause Hacker News, like there's a page on Hacker News, that shows you all of the post on a certain website, but that doesn't have an API.

So I've got a GitHub actions thing that runs a piece of software I wrote called Shots Scraper that runs Playwright, that loads up a browser in GitHub actions to scrape that webpage and turn the results into JSON, which then get turned into an atom feed, which I subscribe to a net newswire. And as a result, anytime any of my articles get posted to Hacker News, I know within like half an hour. And that's useful. That means I can always engage with the conversations and so forth.

The ambitious one that I haven't built yet is our house in Half Moon Bay. I can see the Pacific Ocean out of the window and occasionally there are whales. And I want something that tells me when there's a whale.

CL: Wow.

Simon: So I want to point a camera at the ocean and take a snapshot like every minute and feed it into Google Gemini or something and just say, is there a whale yes or no? That would be incredible. Right? I want push notifications when there's a whale.

CL: Right, right. And then did you register domain already? Isthereawhale.com or something?

Simon: That one I have not, no, but I just need to build this one now. T he ocean is 10 minutes walk away, so I need a good lens pointing at it.

CL: Amazing, final one, favorite podcast or book that's not about data or tech?

Simon: Favorite podcast is easy. There is a comedian in England called Mark Steel, and he does a radio show, which is available as a podcast called Mark Steel is in Town, where every episode he goes to a small town in England and he does a comedy set in a local venue about the history of the town. And so he does very deep research. He figures out everything.

And it's one of those things where all of the jokes are in jokes, but you still get them, you understand, like he'll make a joke about the next town over and everyone will roar with laughter. And I love that. I love that sort of like hyperlocal, like comedy, that sort of British culture thing. That one's really good that. It is Mark Steel's in Town. It's on Radio 4. So yeah, that's my favorite podcast I think.

CL: Any books?

Simon: My favorite genre of fiction is British wizards who get caught up in bureaucracy. And it turns out there's a whole bunch of, like Charles Stross writes "The Laundry Files," which is British wizards in the sort of MI5 for magic. Ben Aaronovitch does "The Rivers of London"series, which is a metropolitan police officer in the Metropolitan Police Department that deals with magic. I just really like that contrast of like magical realism and very clearly researched government paperwork and filings.

CL: Wow. This is amazing.

Dori: Sounds very British too.

Simon: Oh, it really is. Yeah.

CL: Well, this has been a great conversation. Thank you, Simon. Before we go, where can listeners find you and then be helpful to you?

Simon: All of my stuff is on simonwillison.net. That's my blog updated most days with stuff that I'm doing. Other than that, I'm on BlueSky and Mastodon and Twitter, and I've got a ridiculously active GitHub profile. I've just hit a thousand repos now on GitHub, many of which are actively maintained in as much as if you report a bug on them and it happens to make it through my email inbox, I will go and fix the bug.

CL: You haven't had an agent that is like fixing that bug automatically?

Simon: Almost. I'm getting there. I've started pasting URLs to GitHub issues into Claude Code and saying fix it and it does. Quite often, like 75% hit rate on fixing bugs just by pasting the URLs into Claude. That's cool. That's very hopeful.

Dori: That is really cool.

CL: We live a strange times.

Simon: We really do.

Dori: Well, thank you so much again, Simon. This has been a fantastic conversation. Really appreciate you taking the time.

Simon: Thanks. This has been really fun.

Content from the Library

Visit library

Jul 15, 2025

Article

Best Practices for Developing Data Pipelines in Regulated Spaces

How to Think About Data Pipelines in Regulated Spaces Tech teams standing up new AI programs, or scaling existing programs, need...

Apr 17, 2025

Article

How to Properly Scope and Evolve Data Pipelines

For Data Pipelines, Planning Matters. So Does Evolution. A data pipeline is a set of processes that extracts, transforms, and...

Apr 11, 2025

Article

The Role of Synthetic Data in AI/ML Programs in Software

Why Synthetic Data Matters for Software Running AI in production requires a great deal of data to feed to models. Reddit is now...